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Abstract 

The majority of the approaches to the automatic recovery of a panoramic image from a set of partial views 
are suboptimal in the sense that the input images are aligned, or registered, pair by pair, e.g., consecutive frames 
of a video clip. These approaches lead to propagation errors that may be very severe, particularly when dealing 
with videos that show the same region at disjoint time intervals. Although some authors have proposed a post- 
processing step to reduce the registration errors in these situations, there have not been attempts to compute the 
optimal solution, i.e., the registrations leading to the panorama that best matches the entire set of partial views. 
This is our goal. In this paper, we use a generative model for the partial views of the panorama and develop an 
algorithm to compute in an efficient way the Maximum Likelihood estimate of all the unknowns involved: the 
parameters describing the alignment of all the images and the panorama itself. 
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I. INTRODUCTION 



Very good paper to read. Technically sound, very clear, and very good results. My only minor comment 
is that the statement that no attempt has been made to find the optimal solution is not true. There are 
■ many papers tackling this issue, e.g. work of Pollefeys, zisserman's group at oxford, some work at INRIA 
"xl" ■ by various groups, the work of Peleg et al., etc. 

In this paper, we address the problem of recovering, in an automatic way, a panoramic image, or 
^ ; a mosaic, from a set of uncalibrated partial views, e.g., a set of video frames. Modern digital video 
• systems demand efficient solutions for this problem, e.g., for image stabilization [5], [12] and content- 
O ! based representations [1]. Other application fields include virtual reality and remote sensing. The key step 
T"! ! to the success of the automatic mosaic building is the accurate registration, or alignment, of the input 
^ images. 

'x : 

c3 A. Related work 

Although some authors have approached the registration problem using classical signal processing tech- 
niques, such as Fourier transforms [15], or current image analysis tools, such as integral projections [10], 
the majority of the papers in the literature are mostly distinguished by either requiring a low-level pre- 
processing step (feature-based methods) or attempting to register the images directly from their intensity 
levels (featureless methods). 

Feature-based methods, e.g., [6], align the images by first detecting and matching a set of pointwise 
features. Since reliable feature points must correspond to sharp intensity corners [16], [3], this first step 
is hard to accomplish in a fully automatic way when processing real videos, particularly when the images 
are noisy, have low texture, or exhibit a small overlap among them. 

In opposition, featureless methods are optimal, in the sense that they estimate the registration parameters 
by minimizing the difference between the image intensities in a large region, thus leading to more robust 
solutions to the registration of a pair of views, e.g., [11], [13]. However, when building a panorama from a 
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large set of images, practitioners usually register them sequentially, one at a time. This leads to propagation 
errors that may be become visually noticeable if non-consecutive images cover the same region of the 
panorama, which is common in applications such as seabed mapping. Although some authors proposed to 
post-process the registration parameters to deal with this problem [11], [7], there have not been attempts 
to generalize the highly successful featureless methods to the multi-frame case. 

B. Proposed approach: featureless global estimation 

The robustness of the featureless approaches to the registration of two views motivated us to develop 
a featureless method to align a larger set of frames. However, it is not obvious how the two-frame cost 
function, usually the sum of the image square differences [11], [13], should be generalized to the multi- 
frame case. We were able to derive the appropriate cost function, which is an original contribution of this 
paper, by including as unknown, jointly with the registration parameters, the panoramic image itself. 

Our approach in this paper is then to formulate the automatic recovery of mosaics from a set partial 
views, as a classical parameter estimation problem. The input images are modelled as noisy observations 
of limited regions of the unknown panorama. Naturally, since the images are uncalibrated, the problem 
includes as unknowns the parameters describing the registration, or alignment, of the entire set of input 
images. We then use Maximum Likelihood (ML) estimation. To minimize the ML cost with respect to the 
large set of unknowns, we propose an efficient method. First, we derive the closed-form solution for the 
estimate of the panorama in terms of the other unknowns (the registration parameters). Then, we plug-in 
the estimate of the panorama into the ML cost, obtaining an error function that depends on the registration 
parameters alone. This error function is a weighted sum of the square differences between all possible 
pairs of input images. We derive a gradient-descent algorithm to minimize this cost. 

Like in the current featureless approaches to the registration of two images [11], [13], the derivatives 
involved in the gradient-descent algorithm to minimize our ML cost, are computed in a simple way in 
terms of the image gradients. 

C. Paper organization 

The remaining of the paper is organized as follows. In section [Til we formulate the registration of 
multiple images as a classical estimation problem. Section [111] deals with ML estimation for this problem, 
i.e., it introduces the Maximum Likelihood Mosaics (MLM) approach. In section [WJ we develop MLM, 
using the simpler case of registering a pair of images. We contrast MLM with minimizing the registration 
error over a fixed window, as usually done in current featureless approaches. Section IVl generalizes MLM 
to the multi-frame registration. In section |VIJ we derive the gradient-descent algorithm to minimize the 
ML cost. Section IVIII describes experiments and section IVIIII concludes the paper. Preliminary versions 
of parts of this work are in [13], [14]. 

II. Problem Formulation 

In this section, we develop a generative model for the partial views of an unknown panorama, and use 
ML to derive the estimation criterion that will allow us to recover the observed panorama, as well as the 
registration parameters, i.e., the viewing positions. 

A. Generative model 

We model each pixel of each image L, as a noisy sample of the panorama P. For simplicity, we 
consider the image domain to be the entire plane M 2 and, to take care of the limited field of view, we 
define a window H as H(x,y) = 1 in the region observed in the images and H(x,y) = in the regions 
outside the camera field of view. The observation model is then 



Ii(x<) = [P(x )+R(x i )]H(x i ), 



(1) 
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where R denotes the noise, assumed i.i.d. zero-mean Gaussian, x ; are the image coordinates (x,y), 
expressed in the coordinate system of the generic image L, and x are the corresponding coordinates of 
the panoramic image P, expressed in its own coordinate system (which we will refer to as the reference 
coordinate system). Image models related to (Q]) have been used in the context of segmenting and tracking 
moving objects in video sequences [2], [8]. 

The reference coordinate system and the coordinate system of any of the images are related by a generic 
parametric mapping 

Xi = m(0;;x o ). (2) 

The parameter vector 0, in © determines thus the mapping between each pixel of the panorama, with 
coordinates x , expressed in the reference coordinate system, with the corresponding pixel of image Ij, 
with coordinates Xj. Common parameterizations include translation (2 degrees of freedom (dof)), rotation 
(1 dof), rigid motion (3 dof), translation+rotation+zoom (4 dof), afflne (6 dof), and the projective, or 
homography (8 dof), see, e.g., [11], [9]. Although our derivations are intentionally left fully generic, in 
the experiments, we have used the affine mapping. 



B. Estimation criterion 

Given a set of n images, {Ii, . . . , I n }, our goal is to recover all the unknowns involved: the panorama P 
and the set of parameter vectors {0 1 , . . . ,0 n } that define the viewing positions. We use ML. From 
the observation model (OQ), after simple manipulations, we express the symmetric of the log-likelihood 
function as 

L(P,0 1 ,...,0 n ) = ^ln(27ra 2 ) + (2a 2 )" 1 ^ ^T[I 4 (m(0 i; x ))-P (x )f H (m(0 i; x )) , (3) 

x elR 2 »=1 

where iV is the number of pixels in each image and a 2 is the variance of the observation noise. 



III. Maximum Likelihood Mosaics 

To compute the ML estimate of all the unknowns, i.e., to carry out the minimization of the ML cost, 
given by the symmetric log-likelihood ©, with respect to (wrt) {P, 1 , . . . , n }, we start by noticing that 
the estimate of the panorama P can be expressed in closed-form as a function of the remaining unknowns. 

We derive the expression for the ML estimate P of the panorama by minimizing ([3]) wrt a generic 
pixel value P(x ). By making zero the derivative of © wrt P(x ), the estimate P at pixel x is easily 
obtained as a function of the set of unknown registration parameters, which we will compactly denote by 
e = {0!,...,0 n }: 

~ Er=iI*(m(0 t ;x o ))H(m(0 t ;x o )) 

f (X , fc>J - =jj — — rr . (4) 

E i= i H ( m (0*; x o)) 

This expression shows that the estimate of the intensity of each pixel x of P is given by the average of 
the intensities of the corresponding pixels of all the input images that captured x , i.e., all the images Ij 
for which H (m(0 i ; x )) = 1. 



IV. Two-frame Registration 

A. Cost-function 

B. Minimization algorithm 

It is now clear that it is not possible to evaluate the error e(0, x) in © for every pair of values 
of and x. This is the main problem of using a fixed window 1Z — registration is only possible when 
the overlapping region contains 1Z. This limitation has a particular impact on the behavior of iterative 
registration algorithms. In fact, to avoid an exhaustive search, e.g., block matching, the minimization 
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of E(0) in © is usually performed by using gradient-based algorithms that iteratively optimize 0. 
Obviously, at every iteration of the algorithm, the overlapping region (which depends on the current 
estimate of 0) must contain 1Z, since the error E(0), as well as its gradient, depend on a sum over 1Z. 

As illustrated by the first experiment of section IVIH the minimum overlap requirement makes hard the 
automatic registration of arbitrary images. In fact, by specifying a priori a fixed window 1Z, we can not 
cope with all possible situations. If one specifies a small 1Z, it may fit into the true overlapping region, 
but the estimation error will be large due to the smooth minimum of E(0). On the other hand, if one 
specifies a large 1Z, it is not possible to register images which have a small overlap. Our goal here is to 
develop a method to perform the registration in the situations when the overlap between the images is 
not known a priori. 

Instead of using a fixed window 1Z, we propose an adaptive window TZa(O), defined as the largest 
region for which it is possible to evaluate the error e(0, x) as defined in ©. In our iterative optimization, 
the estimate 6 is computed by refining a previous estimate 6 , i.e., 6 = 6 + 5. The update 6 is estimated 
by minimizing the registration error over the adaptive window 7Za(0q), 



where e(0,x) is as defined in ©. The adaptive window 1Z A (6 ), whose size and shape depend on the 
current estimate O of the motion parameter vector, is the overlapping region between the image I and 
the image I' registered according to , l'(m(0o, x)). 

To compute the update 5, we develop an adaptive-window-based Gauss-Newton method. Similar meth- 
ods have been used to minimize ©, i.e., to register images using a fixed window, e.g., [11]. In this 
method, e(0,x) is approximated by its first-order truncated Taylor series expansion, e(0,x) ~ e(0 ,x) + 
6 T ■ V Qe(6 ,x.). Using this approximation in ©, and making zero the gradient of the cost function, we 
get 6 as the solution of the linear system 



where we omit the dependency of e on x and for compactness. From the definition of e in ©, we see 
that V$e in © is computed from the image gradient, V$e = — V^m • V X I'. The initial guess for 6 
is such that m(0 o ,x) is the identity mapping, which corresponds to initializing the algorithm with zero 
displacement between the images, thus the initial window IZa(Oo) is the entire image region. 

The Gauss-Newton method just described assumes the motion is small. To cope with large displace- 
ments, we use a multiresolution scheme. In such scheme, the iterative estimation algorithm is first used 
in a lower resolution versions of the input images, until a certain stopping criterium is reached . The 
resulting parameter estimates are then used as initial guesses for the parameters in the the next (higher) 
resolution and the process is repeated until the original images are used. 

Among the number of valid stopping criteria, we combine the two most obvious: i) the maximum 
number of iterations; and ii) the minimum value of the norm of the update vector 8. Using only ii) is 
not adequate in the low resolution levels, where it is only necessary to make a coarse estimation of the 
parameters. In these levels, convergence may be slow and the overall performance of the algorithm is not 
affected if we simply perform a fixed number of iterations. 

C. Impact of the window 

The motion of the brightness pattern between two images I and I' is described by the parametric 
mapping x' = m(0,x) that maps each pixel of I, with coordinates x, into the corresponding pixel x' 




(5) 




(6) 
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of I'. Featureless approaches to image registration estimate the global motion parameter vector by 
minimizing the error 

E(0) = e 2 (0, x), e(0, x) = I(x) -l'(m(0, x)), (7) 



where the sum is over a fixed, pre- specified, rectangular window 1Z. When the overlap between the images 
is large, the window 1Z is simply chosen as a large rectangle in the interior of the image(s). However, 
when the overlap is small, it is difficult to select a priori an appropriate window 1Z, due to two reasons. 
First, since it is not known beforehand where the overlapping region is, its hard to choose a location for 
the window 1Z. Second, imposing a priori a small window, leads to less accurate estimates of because 
not only the minimum of E(0) in © becomes less sharp but also the local minima phenomena become 
more severe. 




To illustrate the impact of the size of the window, we represent in Fig. Q] the typical evolution of E 
in ©, as a function of a single motion parameter 9, for several sizes of 1Z. Naturally, as anticipated 
above, the larger is 1Z, the smaller is the domain {9} in which E{9) can be evaluated. The several local 
minima and the smoothness of the minimum of E(9) at the true value 9 = 20 in the top plots, obtained 
with relatively small windows, contrast with the single sharp minimum of the bottom-right plot, obtained 
with the largest window (note that the vertical scale is different from plot to plot). 
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V. Multi-frame Registration 

A. Estimate of the registration parameters {81, . . . , n } 

Replacing the ML estimate P of the panorama, given by ©, in the symmetric log-likelihood ©, we 
express this ML cost L as a function of the unknown registration parameters © alone. After algebraic 
manipulations, we get: 

L(0) = ^\n(2na 2 ) + (4a 2 )- 1 £ W" l (xo, 0) ■ (8) 

x eiR 2 

n 

■ J2 E S ( x o> Oj) H (m(0 <; xo)) H (m(0 i; x )) , 

where is the error between the co-registered images I, ; and Ij, 

E - (x , e h Oj) = I, (m(0 <; x )) - lj (m(6> j; x )) , (9) 

and W(x o ,0) is a weight that counts the number of images that have captured the pixel x of the 
panorama, according to the registration parameters in 0, i.e., 

n 

w (x , 0) = H ( m (0*; x o)) • do) 

k=l 

By discarding from © the constant terms, i.e., the terms that do not depend on the unknown registration 
parameters 0, we conclude that the ML estimate for the problem of global multi-frame registration, is 
equivalent to the following minimization: 

e = argnga£ £ . (11) 

For simplicity, when deriving (fTT|) from ©, the sums were interchanged and the spatial region of 
summation was re-defined to take care of the windows H(-) in ©, i.e., TZ^ in (fTTI) is the region where 
the images L and L, overlap, 

Kij = {x : H (m(0;; x)) H (m(0 j; x)) = 1} . (12) 

Expressions (fTOl) and (fTT)) condense one the contributions of this paper — they show that the ML estimate 
of the registration parameters is given by the minimum of a particular weighted sum of the square 
differences between all possible pairs of co-registered input images. 

VI. MLM Algorithm 

Our algorithm to the minimization of the ML cost (fTT)) uses an iterative scheme inspired in the common 
approaches to the two-frame problem [11], [13]. In each step, the algorithm updates a current estimate 
that we denote by 0° = {0?, . . . , 0°}. 

A. Iterative minimization of the ML cost 

Instead of updating the entire set of parameters in a single step, which would be computationally 
complex, we propose a coordinatewise minimization: we update each vector q at a time, keeping fixed 
the remaining registration parameters {0j = 0°,i 7^ q). The update is 6 q = 6° q + S, where <5 is obtained 
from (fTT)) . after discarding the terms that do not depend on q : 

S = argmm^ ^ W (x o ,0°) (13) 
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To obtain a closed-form solution for the update d, we approximate the error E iq by its first-order Taylor 
series expansion, 



e 19 (x , el e» + 6)* E ig (x , el e») + s 1 ■ v e E iq (* , el e») . (14) 

From the definition of E iq in ©, the gradient in the Taylor series expansion is easily computed in terms 
of the spatial gradient of image l q . Furthermore, that gradient does not depend on 0®, thus we will denote 
it more compactly by V(x , 0°), 

v(x ,o q )=v e E iq (x ,ele° q ) d5) 

= -V 0g m(0° q ; xo) • V x I g (m(6>; ; x )) . (16) 

By inserting the Taylor series approximation in (fT3l and making zero the derivative wrt d, we get the 
update S as the solution of a linear system 

r(0°) -5 + 7(0°) =0. (17) 

The matrix r(0°) and the vector 7 (0°) are obtained as 

r ( °)=E V(x ,^)-V T (x ,^), (18) 

7 (0O) = ^V(xo,^)[p(xo,0 o )-I g (m(^;x o ))] , (19) 

where we used expression © for P. The sums in (1181191) are over the region observed by image l q , 
^={x:H(m(^;x)) = l}. 



B. Interpretation in terms of current algorithms 

Since the iterations in standard featureless two-frame alignment algorithms [11], [13] also lead to a 
system like (fTTT) . we now interpret our solution (I17I18I19I) in terms of those approaches. Define E 0l? as 
the difference between image I q and the previous estimate of the panorama, obtained with the registration 
parameters 0°, 

E 0g (x , 0°) = P (xo, 0°) - l,(m(0; ; x )) . (20) 

Since the gradient of this error wrt q is equal to the one defined in (fl~5l) . we can re-write expressions 
(1181191) in terms of E 0q , 

r (0°) = E V e q E °o ( x o> ©°) • Vg Eog (x , 0°) , (21) 
7 (0°) = V Eo, (x , 0°) E 0q (x , 0°) . (22) 

Expressions (1211221) are equal to the ones that arise from aligning the previous estimate P of the 
panorama with image I q , by using standard featureless methods, see e.g., [11], [13] or [3]. We thus conclude 
that our global approach lead to an algorithm that refines the estimate of the registration parameters of 
each image by using the methodology developed to register a single pair of images. 
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C. Convergence — initialization and multire solution 

Our algorithm starts by aligning the images sequentially, using the standard two-frame approach [11], 
[13]. Then, we compute an initial estimate of the panorama by using ©. After this, we cyclically refine 
the registrations parameters of each image. The stopping criterion may either be the error below a small 
threshold or reaching a maximum number of iterations. 

Since the truncated Taylor series is a good approximation only when the vector 6 q is close to its 
initial value 6° q , estimating the update S from (1 171 1 81 191) leads to the convergence to the globally optimal 
ML estimate, only when the initial estimate is close enough to it. However, in practice, e.g., in the first 
experiment described below, it is common that the initial estimate of the panorama is very rough, due to 
the propagation of (two-frame based) registration errors. To cope with these situations, we use a coarse- 
to-fine approach similar to the one proposed in [4], [11]: the parameters are first estimated in the coarsest 
resolution level, then used as an initialization to the next finer level, until the full image resolution is 
attained. As illustrated in the following section, this multi-resolution approach succeeds in correcting large 
miss-registrations. 

VII. Experiments 

We describe two experiments. The first experiment compares our global approach with the current 
sequential registration methods. In the second experiment, we illustrate with automatic mosaic building 
in a seabed mapping context. 

A. Adaptive window versus fixed window 

To illustrate how our method performs better than the usual fixed window method, we synthesized 
input images by cropping a real photography and adding noise. This corresponds to a simple translational 
motion model, which suffices to show the advantage of using our adaptive window method. 

In Fig. 12, the overlap between input images is large. The left image of Fig. [2] shows the failure of 
the algorithm with a fixed window of size 64. Our algorithm and the one with a fixed window of size 
128 both lead to good results, see the middle and right images of Fig. |2] Note that, although these two 
images are visually indistinguishable, the estimate of the global motion provided by our algorithm is more 
accurate because it minimizes the error over the largest possible window. 




Fig. 2. Registration of a pair of images. Left: using a fixed window of size 64 (registration failure). Middle: fixed window size 128. Right: 
our algorithm. 

In Fig. [3] the overlap between input images is small, thus it is impossible to use a fixed window of a 
large size. When using a fixed window of size 64, the usual algorithm fails, see the left image of Fig. [3] 
The right image of Fig. [3] shows that our algorithm succeeds in this challenging situation. 

Finally, Fig. |4] shows a mosaic obtained by pairwise registering, sequentially, a set of input images. 
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Fig. 3. Registration of images with very small overlap. Left: using a fixed window of size 64 (registration failure). Right: our algorithm. 




Fig. 4. Mosaic of images with small overlap. Top: original images. Bottom: mosaic built by using our algorithm to register those images. 

B. MLM versus sequential alignment 

To have an exact knowledge of the ground truth, we "synthesized" the input images by cropping a 
real photo and adding noise. In Fig. [51 we represent the evolution of the standard two-frame featureless 
sequential alignment {e.g., [11], [13]) of those images. Note that the fourth image is miss-aligned and how 
that error propagates to the alignment of the remaining images. The (highly incorrect) panorama this way 
obtained, see the bottom right image of Fig. [51 was then used as the initialization for the global method 
we propose in this paper. After few iterations, our algorithm converged to the panoramic image shown in 
Fig. (6[ which is visually indistinguishable from the ground truth image. 

C. Underwater mosaic for seabed mapping 

As a final example, we illustrate our method with automatic mosaic construction from video images, 
captured by an underwater camera in the sea. Although underwater images are particularly difficult to 
align, due to the absence of salient features, the mosaic recovered by our algorithm is visually correct, 
see Fig. [8] 

As a final example, we use images captured by an underwater camera in the sea. Fig. [7J shows four of 
those images. The low texture and the almost total absence of salient feature points make these images 
particularly challenging. Note also that although the overlapping region between images is not very small, 
its shape is not rectangular. In this situation, the traditional fixed window method would use a small 
rectangular window inside the overlapping region, thus failing to use all the information available. 

In Fig. [8] we represent the seabed mosaic obtained by using our algorithm to sequentially register the 
images of Fig. [71 
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VIII. Conclusion 

We proposed a new method to build a panoramic image from a set of partial views. Rather than 
composing the input images in an incremental way, our approach seeks the global solution to the estimation 
problem, i.e., it computes the panorama that best matches all the partial observations. To minimize the 
global cost, we derived an efficient gradient descent algorithm that generalizes the current most robust 
two-frame featureless registration approaches. 
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Fig. 5. Sequential registration. Note how the miss-alignment of the fourth image (middle left) propagates to the remaining ones. 
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Fig. 8. Mosaic built by using our algorithm to register the images of Fig. [7] 



